Chocolate Chocolate

1 Introduction

Hi, I’m Ying! Do you love Chocolate? I will not be surprised if you say “YES”! Chocolate is fantastic. It has a beautiful taste and makes people feel happy. According to Healthline, studies show that dark Chocolate can improve one’s health and lower the risk of heart disease. So as Christmas is around the corner, let’s look at Chocolate and find the mystery behind the fantastic sweets!

I focus on Chocolate and use the dataset called Chocolate Bar Ratings from https://www.kaggle.com/rtatman/chocolate-bar-ratings.

Following are the introduction of the dataset: Each year, residents of the United States collectively eat more than 2.8 billion pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate beans used, and where the beans were grown.

About the dataset: This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate beans used, and where the beans were grown. A rating here only represents an experience with one bar from one batch and represents the overall experience of flavor, texture, and the after melt of the Chocolate.

Using this dataset, I invited you to learn more about the charming Chocolate!

1.1 Data

In the section below, I will explain how I proceeded with this project and how I did the data cleaning.

1.Data Sources To find out the answers, I import data Chocolate Bar Ratings(https://www.kaggle.com/rtatman/chocolate-bar-ratings), all the data is downloaded directed from the website, not pre-clean with the data. Everything was done in R as you can see in later steps.

2.Data Cleaning In order to do the analysis, I firstly load the libraries and proceed with checking the data.

#require(tm)
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
require(skimr)
## Loading required package: skimr
require(kableExtra)
## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
require(ggplot2)
require(RColorBrewer)
## Loading required package: RColorBrewer
library(here)
## here() starts at /Users/cassie/Documents/UT学习/21FA/R/Week14/Final
library(tidyverse)
library(anytime)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
require(ggplot2)
library(grid)
library(gridExtra)
library(kableExtra)
library(wordcloud)
library(corrplot)
## corrplot 0.91 loaded
#library(tm)

2 Research question

There are lots of interesting things to be found for this dataset.

My personal interests are…

  • Which companies produce good chocolate ( want to buy some for the Christmas!) ?

  • What is the average rate of the chocolate?

  • Where are good chocolates comes from?

  • What is the relationship between cocoa percentage and the quality of chocolate?

  • Can we predict the rating of a chocolate?

Import the dataset and see its structure.

ChocolateData <- read.csv("../data/flavors_of_cacao.csv") 
str(ChocolateData) 
## 'data.frame':    1795 obs. of  9 variables:
##  $ Company...Maker.if.known.       : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
##  $ Specific.Bean.Origin.or.Bar.Name: chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
##  $ REF                             : int  1876 1676 1676 1680 1704 1315 1315 1315 1319 1319 ...
##  $ Review.Date                     : int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
##  $ Cocoa.Percent                   : chr  "63%" "70%" "70%" "70%" ...
##  $ Company.Location                : chr  "France" "France" "France" "France" ...
##  $ Rating                          : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
##  $ Bean.Type                       : chr  " " " " " " " " ...
##  $ Broad.Bean.Origin               : chr  "Sao Tome" "Togo" "Togo" "Togo" ...
summary(ChocolateData) 
##  Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name      REF      
##  Length:1795               Length:1795                      Min.   :   5  
##  Class :character          Class :character                 1st Qu.: 576  
##  Mode  :character          Mode  :character                 Median :1069  
##                                                             Mean   :1036  
##                                                             3rd Qu.:1502  
##                                                             Max.   :1952  
##   Review.Date   Cocoa.Percent      Company.Location       Rating     
##  Min.   :2006   Length:1795        Length:1795        Min.   :1.000  
##  1st Qu.:2010   Class :character   Class :character   1st Qu.:2.875  
##  Median :2013   Mode  :character   Mode  :character   Median :3.250  
##  Mean   :2012                                         Mean   :3.186  
##  3rd Qu.:2015                                         3rd Qu.:3.500  
##  Max.   :2017                                         Max.   :5.000  
##   Bean.Type         Broad.Bean.Origin 
##  Length:1795        Length:1795       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
head(ChocolateData, 10)
##    Company...Maker.if.known. Specific.Bean.Origin.or.Bar.Name  REF Review.Date
## 1                   A. Morin                      Agua Grande 1876        2016
## 2                   A. Morin                            Kpime 1676        2015
## 3                   A. Morin                           Atsane 1676        2015
## 4                   A. Morin                            Akata 1680        2015
## 5                   A. Morin                           Quilla 1704        2015
## 6                   A. Morin                         Carenero 1315        2014
## 7                   A. Morin                             Cuba 1315        2014
## 8                   A. Morin                     Sur del Lago 1315        2014
## 9                   A. Morin                   Puerto Cabello 1319        2014
## 10                  A. Morin                          Pablino 1319        2014
##    Cocoa.Percent Company.Location Rating Bean.Type Broad.Bean.Origin
## 1            63%           France   3.75                    Sao Tome
## 2            70%           France   2.75                        Togo
## 3            70%           France   3.00                        Togo
## 4            70%           France   3.50                        Togo
## 5            70%           France   3.50                        Peru
## 6            70%           France   2.75   Criollo         Venezuela
## 7            70%           France   3.50                        Cuba
## 8            70%           France   3.50   Criollo         Venezuela
## 9            70%           France   3.75   Criollo         Venezuela
## 10           70%           France   4.00                        Peru

Now we have the dataframe, We have 1795 observations and 9 variables. but I don’t like the way how it looks.

I want to change…

  • Change the colnames to be more readable

  • Change the properer data type

  • Deal with missing value

  • Delete “REF”, I will not use this variable

colnames(ChocolateData) <- c("Company", "BarOrigin", "REF", "ReviewDate", "CocoaPct", "Loc", "Rating", "Type", "BeanOrigin")
ChocolateData$CocoaPct <- gsub("[%]", "", ChocolateData$CocoaPct)
ChocolateData$CocoaPct <- as.numeric(ChocolateData$CocoaPct)
ChocolateData[, c(8,9)] <- sapply(ChocolateData[,c(8,9)], str_trim)
is.na(ChocolateData) <- ChocolateData==''
ChocolateData <- ChocolateData[, -3]

str(ChocolateData)
## 'data.frame':    1795 obs. of  8 variables:
##  $ Company   : chr  "A. Morin" "A. Morin" "A. Morin" "A. Morin" ...
##  $ BarOrigin : chr  "Agua Grande" "Kpime" "Atsane" "Akata" ...
##  $ ReviewDate: int  2016 2015 2015 2015 2015 2014 2014 2014 2014 2014 ...
##  $ CocoaPct  : num  63 70 70 70 70 70 70 70 70 70 ...
##  $ Loc       : chr  "France" "France" "France" "France" ...
##  $ Rating    : num  3.75 2.75 3 3.5 3.5 2.75 3.5 3.5 3.75 4 ...
##  $ Type      : chr  NA NA NA NA ...
##  $ BeanOrigin: chr  "Sao Tome" "Togo" "Togo" "Togo" ...
head(ChocolateData, 10)
##     Company      BarOrigin ReviewDate CocoaPct    Loc Rating    Type BeanOrigin
## 1  A. Morin    Agua Grande       2016       63 France   3.75    <NA>   Sao Tome
## 2  A. Morin          Kpime       2015       70 France   2.75    <NA>       Togo
## 3  A. Morin         Atsane       2015       70 France   3.00    <NA>       Togo
## 4  A. Morin          Akata       2015       70 France   3.50    <NA>       Togo
## 5  A. Morin         Quilla       2015       70 France   3.50    <NA>       Peru
## 6  A. Morin       Carenero       2014       70 France   2.75 Criollo  Venezuela
## 7  A. Morin           Cuba       2014       70 France   3.50    <NA>       Cuba
## 8  A. Morin   Sur del Lago       2014       70 France   3.50 Criollo  Venezuela
## 9  A. Morin Puerto Cabello       2014       70 France   3.75 Criollo  Venezuela
## 10 A. Morin        Pablino       2014       70 France   4.00    <NA>       Peru

Cool, I like how it looks now.

Now we have 8 variables. Let’s look at distributions of these variables. I will plot the categorical variables as bar charts, showing the most popular values and see what we could find.

#top companies
Top_companies <- ChocolateData %>%
  group_by(Company) %>% 
  summarise(Count= n())%>%
  top_n(10, wt = Count) %>%
  arrange(desc(Count))

ggplot(Top_companies, aes(reorder(Company, Count),  Count, fill = Count)) + 
  coord_flip() +
  geom_bar(stat = "identity", size = 0.1)+xlab("Top_Companies")

#Review_cocoa_date
Review_cocoa <- ChocolateData %>%
  group_by(ReviewDate) %>% 
  summarise(Count= n())

  ggplot(Review_cocoa, aes(x =factor(ReviewDate), y = Count, fill = Count)) + 
  geom_bar(stat = "identity", size = 0.1) +
  xlab("Review Date") +
  coord_flip()

#BarOrigin
BarOrigin_new <- ChocolateData %>%
  group_by(BarOrigin) %>% 
  summarise(Count= n()) %>% 
  top_n(10, wt = Count) %>%
  arrange(desc(Count))

  ggplot(BarOrigin_new, aes(reorder(BarOrigin, Count),  Count, fill = Count)) + 
  coord_flip() +
  geom_bar(stat = "identity", size = 0.1) + xlab("Top_BarOrigin")

#BeanTypes
  BeanTypes <- ChocolateData %>%
   group_by(Type) %>% 
   na.omit() %>%
   summarise(Count= n()) %>% 
   mutate(pct=Count/sum(Count)) %>% 
   top_n(10, wt = pct)
  
   ggplot(BeanTypes, aes(x =reorder(Type,pct), y =pct, fill = pct)) + 
   geom_bar(stat = "identity", size = 0.1) + 
   coord_flip() +
   xlab("Bean_Type") +ylab("Percentage")

So we can find…

  • The top 3 Companies are Soma, Bonnat, Fresco
  • The top 3 ReviewDate are 2015, 2014, 2016 (What happened in 2015?)
  • The top 3 BarOrigin are Magascar, Peru, Ecuador
  • The top 3 type is Trinitario, Ciollo, Forastero (I omit all NA)

After finishing these chart, I have a basic understanding about the data, now I will go to my questions and try to find answers.

3 Which companies produce good chocolate

For my first question: Which companies produce good chocolate(want to buy some for Christmas!)?

Company_rating <- ChocolateData %>%
  group_by(Company) %>%
  summarize(rating = mean(Rating), count = n()) %>%
  arrange(desc(count),desc(rating));

head(Company_rating, n = 10)
## # A tibble: 10 × 3
##    Company                    rating count
##    <chr>                       <dbl> <int>
##  1 Soma                         3.59    47
##  2 Bonnat                       3.44    27
##  3 Fresco                       3.38    26
##  4 Pralus                       3.28    25
##  5 A. Morin                     3.38    23
##  6 Arete                        3.53    22
##  7 Domori                       3.48    22
##  8 Guittard                     3.17    22
##  9 Valrhona                     3.33    21
## 10 Hotel Chocolat (Coppeneur)   2.97    19
Companies <- ChocolateData %>%
  group_by(Company) %>% 
  filter(n() > 10) %>% 
  mutate(avg = mean(Rating))

Companies  %>%
  ggplot(aes(x = reorder(as.factor(Company), Rating, FUN = mean), y = Rating)) + 
  geom_point(aes(x = as.factor(Company), y = avg, colour = avg)) + 
  geom_count(alpha = .1) + 
  coord_flip() + 
  labs(x = "Company", y = "Rating") 

Take review times and average rating into consideration, Soma is NO.1. Amedel is excellent, but the sample size is rather small. Therefore, I think the most easy to buy Chocolate with high quality is more likely to be found in Soma. Let’s buy Soma!

4 What is the average rate of the chocolate?

After find our the company, I want to understand the average performance of chocolate rating. So I draw a bar chart to find the distribution.

ggplot(ChocolateData, aes(factor(Rating))) + 
  geom_bar(fill = "steelblue") + 
  xlab("Rating") 

summary(ChocolateData$Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.875   3.250   3.186   3.500   5.000

Flavors of Cacao Rating System: 5= Elite (Transcending beyond the ordinary limits) 4= Premium (Superior flavor development, character and style) 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities) 2= Disappointing (Passable but contains at least one significant flaw) 1= Unpleasant (mostly unpalatable)

Most of the rates are between 2.75-3.75, I’m happy to see we have most rating lie around 3.5, which is Satisfactory! This makes me feel more confident about buy chocolate randomly from any company without worrying too much about its flavor. Also, mean of the rating are 3.186, median is 3.25, we find the answer of what is the average rate of chocolate.

However, the number is not straightforward, I want to make it readable by adding notes to it. First, rearrange ratings to 5 groups.

Rating_Pct_Com <- data.frame(RatingLev = c("Unpleasant","Disappointing","Satisfactory-Praiseworthy","Premium","Elite"),
 Rating = c("1 <= Rating < 2", "2 <= Rating < 3", "3 <= Rating <= 3.75", "3.75 < Rating <= 5", "Rating = 5"), 
 Note =  c("Elite (Transcending beyond the ordinary limits)",
    "Premium (Superior flavor development, character and style)",
    "Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)",
    "Disappointing (Passable but contains at least one significant flaw)",
    "Unpleasant (mostly unpalatable)"))

kbl(Rating_Pct_Com, caption = "Chocolate Rating Description") %>%
kable_classic(html="Cambria", full_width=FALSE)
Chocolate Rating Description
RatingLev Rating Note
Unpleasant 1 <= Rating < 2 Elite (Transcending beyond the ordinary limits)
Disappointing 2 <= Rating < 3 Premium (Superior flavor development, character and style)
Satisfactory-Praiseworthy 3 <= Rating <= 3.75 Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
Premium 3.75 < Rating <= 5 Disappointing (Passable but contains at least one significant flaw)
Elite Rating = 5 Unpleasant (mostly unpalatable)

#Where are good chocolates comes from? Next, I want to find out where are good Chocolate come from? First, I create a wordcloud for the company loaction because I always like word cloud and it is really cool! Then, I draw a boxplot to see the relationship between company location and rating.

word_choc <- gsub(" ", "",ChocolateData$Loc)
wordcloud(word_choc, max.words = 200, random.order = FALSE, scale = c(4,0.7), rot.per = 0.5, colors = brewer.pal(8, "Dark2"))
## Loading required namespace: tm
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents

This is cool, we know the Top countries in a visual way. Do these countries have the best chocolates? I want to create a boxplot to see the distributions about countries and ratings.I decide to display those countries that have been rated for more than 5 times.

ChocolateData %>%
  group_by(Loc) %>% 
  filter(n() > 5) %>% 
  mutate(avg = mean(Rating)) %>%
  ggplot() + 
  geom_boxplot(aes(reorder(Loc, avg), Rating, fill = avg)) + 
  scale_fill_continuous(low = "#132B43", high = "#56B1F7", name = "Average rating") + 
  coord_flip() + 
  labs(x = "Company Location", y = "Rating") 

summary(ChocolateData$Rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.875   3.250   3.186   3.500   5.000

So we can tell that the good chocolate come for Australia, Switzerland, Italy and Canada.

5 What is the relationship between cocoa percentage and the quality of chocolate?

Cocoa percent Let’s investigate to see whether there is a relationship between cocoa percentage and the chocolate’s rating. My assumption : the higher the cocoa percentage, the more bitter the chocolate tastes.

ChocolateData%>%
  ggplot(aes(x = Rating, y = CocoaPct)) +
  geom_point() +
  labs(x = "Rating", y ="Cocoa percentage" ) + 
  geom_smooth(method = "lm", se = FALSE, col = "brown")
## `geom_smooth()` using formula 'y ~ x'

From the chart, I would like to see there is not very strong linear relationship between cocoa percentage and chocolate rating. So I try to go deeper.

model_1 <- lm(formula = Rating ~ CocoaPct, data = ChocolateData)

summary(model_1)
## 
## Call:
## lm(formula = Rating ~ CocoaPct, data = ChocolateData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2071 -0.3196  0.0429  0.3178  1.7929 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.079388   0.126757  32.183  < 2e-16 ***
## CocoaPct    -0.012461   0.001761  -7.076 2.12e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4717 on 1793 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.02662 
## F-statistic: 50.07 on 1 and 1793 DF,  p-value: 2.122e-12

As we can see, the adjusted R-squared of 0.02662 and the p-value is 2.122e-12. lm is not a good model to measure the relationship between cocoa percentage and chocolate rating. But the negative slope as well as the chart implies that the higher the cocoa percent, the lower the rating of chocolate could be, which is the opposite to my assumption.

6 Can we predict the rating of a chocolate?

If we want to find a model to help with predicting the rating of the chocolate, we need to go back to the main dataset and add more variables into consideration.

chocolate_cor <- data.frame(ChocolateData$ReviewDate, ChocolateData$CocoaPct, ChocolateData$Rating)
names(chocolate_cor)[1:3] <- c("Review Date", "Cocoa Percent", "Rating")

chocolate_cor <- round(cor(chocolate_cor), 3)

chocolate_cor %>%
  kbl(caption = "The relationship between Rating, Review date and Cocoa Percent", digits= 3) %>% 
  kable_classic( html="Cambria", full_width=FALSE)
The relationship between Rating, Review date and Cocoa Percent
Review Date Cocoa Percent Rating
Review Date 1.000 0.038 0.100
Cocoa Percent 0.038 1.000 -0.165
Rating 0.100 -0.165 1.000
Relations <- corrplot(chocolate_cor, method = 'circle', type = 'upper', tl.srt = 30)

We can see from the chart that there is no strong connections between these variables, slightly negative connection between rating and cocoa percents. Very slightly positive connection between Review Date and Rating.

7 Conclusion

7.0.1 Which companies produce good chocolate(want to buy some for the Christmas!)?

Buy Soma!

7.0.2 What is the average rate of the chocolate?

Pretty good!Most of the rates are between 2.75-3.75, mean of the rating are 3.186, median is 3.25.

7.0.3 Where are good chocolates comes from?

In general, Canada and Italy are better choices, even though USA ranking first in company distribution, they do not produce the highest rating chocolate.

7.0.4 What is the relationship between cocoa percentage and the quality of chocolate?

It is hard to say that there are some relationship between these two variables. Maybe slightly negative. I guess it is because the more cocoa percentage it is , the bitter the chocolate will be. And most people like sweet taste.Choosing 70%-75% cocoa percentage is most likely to buy awesome chocolate.

7.0.5 Can we predict the rating of a chocolate?

From my analysis, it is hard to say there are any factors that help with predicting the rating of chocolate. It is likely to buy higher chocolate if you choose to buy 70%-75% cocoa percentage. It would be better if the companies were from Australia, Switzerland, Italy, and Canada. To make things easier, buy Soma.

Furthermore, I think cocoa’s type and bar origin will play an important role in rating chocolate. I will go deeper and use these variables to find the relationship if I have a chance.

Thanks for going through the Chocolate review journey with me! Nevertheless, there are limitations to this simple research. I don’t really go deeper but use some common variables to describe the relationship. The latest data is from 2017, which is about four years ago. How is the situation now? Will Covid influence the transport of cocoa, therefore influent the produce of chocolate? I have no answers to this question. I hope I will be able to answer them someday by continuing my study.

8 Bibliography

1.Photo, https://www.pinterest.com/pin/586734657714494204

2.Healthline.“7 Proven Health Benefits of Dark Chocolate” https://www.healthline.com/nutrition/7-health-benefits-dark-chocolate?epik=dj0yJnU9bXYyeWVSd3ZMb0dFQjg4ZXg3ZGl0QjRWdU9YLWM5ZWkmcD0wJm49NEpQM3lVWmNtQ2kzY1lMNDdfYkUwQSZ0PUFBQUFBR0d0amU4#TOC_TITLE_HDR_9

3.Jason Horn,“What Is the Difference Between Bittersweet and Semisweet Chocolate?” https://greatist.com/eat/what-is-the-difference-between-bittersweet-chocolate-and-semisweet-chocolate

4.Coffee Quality database from CQI, https://www.kaggle.com/volpatto/coffee-quality-database-from-cqi

5.Chocolate Bar Ratings, https://www.kaggle.com/rtatman/chocolate-bar-ratings Note: I can a lot of inspiration from NO.4 &NO.5, after my finish the 1st version of my code, I made some adjustments based on some codes and logic, I truly appreciate it but all the codes of my final version are created by myself.